“In this course we will learn to understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research.”
(https://sofiaoja.github.io/IODS-project/)
This week I have learned Data Wrangling and Analysis.
# Sofia 020217 Regression and model validation excercise
# title: New R script file for Analysis created 310117
# author: Sofia Oja
# date: 31 tammikuuta 2017
# read the data into memory
learning2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header=TRUE, sep=",")
# see the stucture of the new dataset
# learning2014 data set has 166 obs. of 7 variables: gender, Age, Attitude, deep, stra, surf, Points
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
dim(learning2014)
## [1] 166 7
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
# Parameters resembling normal distrubtion: Attitude, deep, stra, surf. Strong correlation between points and attitude (0.437), surf and deep (-0.324), surf and stra (-0.161)
p
# print out a summary
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
# regression model with multiple explanatory variables
# Positive correlation between points and attitude (0.437), points and stra (0.146), negative correlation between points and surf (-0.144). To improve linear model surf parameter excluded.
#my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
my_model <- lm(points ~ attitude + stra, data = learning2014)
# print out a summary of the model
# The relationship between the chosen explanatory variables and the target variable is statistically significant for variable attitude ('***' 0.001), and not for stra ('.' 0.1). Multiple R squared of the model is 0.2048, which is quite low thus linear model might not be optimal model for this data. (Ref Wikipedia: In statistics, the coefficient of multiple correlation is a measure of how well a given variable can be predicted using a linear function of a set of other variables).
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
# diagnostic plots using the plot() function. Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage, plots 1, 2 and 5
# Some outliers seen from diagnostic plots, e.g. in Normal Q-Q plot data not linear in the beginnig and at the end points thus linear model might not be optimal model for this data.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
This week I have learned Logistic Regression.
# Sofia 100217 Logistic Regression excercise
# title: RStudio Exercise 3 RMarkdown file for Analysis created 100117
# author: Sofia Oja
# date: 10 helmikuuta 2017
# set working directory
setwd("~/GitHub/IODS-project")
# read CSV into R
alcjoined <- read.csv(file="Data/alcjoinedmodified.csv", header=TRUE, sep=";")
# stucture of the new dataset
# 'data.frame': 382 obs. of 36 variables:
str(alcjoined)
## 'data.frame': 382 obs. of 36 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : Factor w/ 9 levels "1","1,5","2",..: 1 1 4 1 2 2 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
# dimension of the new dataset
# [1] 382 36
dim(alcjoined)
## [1] 382 36
# colnames of the new dataset
# [1] "X" "school" "sex" "age" "address"
# [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
# [11] "Fjob" "reason" "nursery" "internet" "guardian"
# [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
# [21] "paid" "activities" "higher" "romantic" "famrel"
# [26] "freetime" "goout" "Dalc" "Walc" "health"
# [31] "absences" "G1" "G2" "G3" "alc_use"
# [36] "high_use"
colnames(alcjoined)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
# For parameters absences, sex, failures and freetime have to my hypothesis relationships with alcohol consumption.
# access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# draw a bar plot of each variable
gather(alcjoined) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables; they will
## be dropped
library(ggplot2)
# initialise a plot of high_use and absences
g1 <- ggplot(alcjoined, aes(x = high_use, y = absences, col = sex))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
# find the model with glm()
m <- glm(high_use ~ failures + absences + sex + freetime, data = alcjoined, family = "binomial")
# print out a summary of the model
m
##
## Call: glm(formula = high_use ~ failures + absences + sex + freetime,
## family = "binomial", data = alcjoined)
##
## Coefficients:
## (Intercept) failures absences sexM freetime
## -2.75479 0.43004 0.09461 0.84223 0.27453
##
## Degrees of Freedom: 381 Total (i.e. Null); 377 Residual
## Null Deviance: 465.7
## Residual Deviance: 419.5 AIC: 429.5
# print out the coefficients and summary of the model
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -2.75479 0.46053 -5.982 2.21e-09 ***
#failures 0.43004 0.19312 2.227 0.025961 *
#absences 0.09461 0.02280 4.150 3.33e-05 ***
#sexM 0.84223 0.24705 3.409 0.000652 ***
#freetime 0.27453 0.12543 2.189 0.028621 *
# For parameters absences and sex have very high significance ***, also failures and freetime are significant *.
coef(m)
## (Intercept) failures absences sexM freetime
## -2.75478571 0.43003682 0.09460961 0.84223122 0.27453170
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + freetime,
## family = "binomial", data = alcjoined)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9664 -0.8170 -0.6065 1.0585 2.0330
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.75479 0.46053 -5.982 2.21e-09 ***
## failures 0.43004 0.19312 2.227 0.025961 *
## absences 0.09461 0.02280 4.150 3.33e-05 ***
## sexM 0.84223 0.24705 3.409 0.000652 ***
## freetime 0.27453 0.12543 2.189 0.028621 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 419.50 on 377 degrees of freedom
## AIC: 429.5
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.06362265 0.02504274 0.1528626
## failures 1.53731413 1.05427814 2.2602725
## absences 1.09922964 1.05343261 1.1522505
## sexM 2.32154106 1.43694673 3.7924512
## freetime 1.31591429 1.03172262 1.6889201
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alcjoined <- mutate(alcjoined, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcjoined <- mutate(alcjoined, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alcjoined$high_use, prediction = alcjoined$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 84 30
# access dplyr and ggplot2
library(dplyr); library(ggplot2)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alcjoined, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alcjoined$high_use, prediction = alcjoined$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.21989529 0.07853403 0.29842932
## Sum 0.88219895 0.11780105 1.00000000
This week I have learned Clustering and Classification.
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
# 506 obs. of 14 variables
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2); library(corrplot)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
# print the correlation matrix
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
# High correlation between variables rad and tax variables, also correlation between indus and nox, indus and tax, nox and age.
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# save the scaled crim as scaled_crim
scaled_crim <- boston_scaled$crim
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
crime
## [1] low low low low low low med_low
## [8] med_low med_low med_low med_low med_low med_low med_high
## [15] med_high med_high med_high med_high med_high med_high med_high
## [22] med_high med_high med_high med_high med_high med_high med_high
## [29] med_high med_high med_high med_high med_high med_high med_high
## [36] low med_low low med_low low low med_low
## [43] med_low med_low med_low med_low med_low med_low med_low
## [50] med_low med_low low low low low low
## [57] low low med_low med_low med_low med_low med_low
## [64] med_low low low low low med_low med_low
## [71] med_low med_low med_low med_low low med_low med_low
## [78] med_low low med_low low low low low
## [85] low low low low low low low
## [92] low low low low med_low med_low med_low
## [99] low low med_low med_low med_low med_low med_low
## [106] med_low med_low med_low med_low med_high med_low med_low
## [113] med_low med_low med_low med_low med_low med_low med_low
## [120] med_low low low med_low med_low med_low med_low
## [127] med_high med_high med_high med_high med_high med_high med_high
## [134] med_high med_high med_high med_high med_high med_low med_high
## [141] med_high med_high med_high high med_high med_high med_high
## [148] med_high med_high med_high med_high med_high med_high med_high
## [155] med_high med_high med_high med_high med_high med_high med_high
## [162] med_high med_high med_high med_high med_high med_high med_high
## [169] med_high med_high med_high med_high med_low med_low med_low
## [176] low low low low low low low
## [183] med_low med_low med_low low low low med_low
## [190] med_low med_low low med_low low low low
## [197] low low low low low low low
## [204] low low med_low med_low med_low med_low med_high
## [211] med_low med_high med_low med_low med_high med_low low
## [218] low med_low med_low med_high med_high med_high med_high
## [225] med_high med_high med_high med_high med_high med_high med_high
## [232] med_high med_high med_high med_high med_high med_high med_high
## [239] med_low med_low med_low med_low med_low med_low med_low
## [246] med_low med_high med_low med_low med_low med_low med_low
## [253] med_low med_high low low low med_high med_high
## [260] med_high med_high med_high med_high med_high med_high med_high
## [267] med_high med_high med_high med_low med_high med_low med_low
## [274] med_low low med_low med_low low low med_low
## [281] low low low low low low low
## [288] low low low low low low med_low
## [295] low med_low low med_low low low low
## [302] low med_low med_low low low low low
## [309] med_high med_high med_high med_high med_high med_high med_high
## [316] med_low med_high med_low med_high med_high med_low med_low
## [323] med_high med_high med_high med_low med_high med_low low
## [330] low low low low low low low
## [337] low low low low low low low
## [344] low low low low low low low
## [351] low low low low low med_low high
## [358] high high high high high high high
## [365] med_high high high high high high high
## [372] high high high high high high high
## [379] high high high high high high high
## [386] high high high high high high high
## [393] high high high high high high high
## [400] high high high high high high high
## [407] high high high high high high high
## [414] high high high high high high high
## [421] high high high high high high high
## [428] high high high high high high high
## [435] high high high high high high high
## [442] high high high high high high high
## [449] high high high high high high high
## [456] high high high high high high high
## [463] high high high med_high high high high
## [470] high high high med_high high high high
## [477] high high high high high high high
## [484] med_high med_high med_high high high med_low med_low
## [491] med_low med_low med_low med_low med_high med_low med_high
## [498] med_high med_low med_low med_low low low low
## [505] med_low low
## Levels: low med_low med_high high
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ . , data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2599010 0.2376238 0.2500000 0.2524752
##
## Group means:
## zn indus chas nox rm
## low 0.9779328 -0.9710340 -0.12234430 -0.8879052 0.4770923
## med_low -0.1080455 -0.3062293 0.01475116 -0.5897675 -0.1309820
## med_high -0.3845049 0.1384268 0.23442641 0.2986077 0.1453678
## high -0.4872402 1.0171096 -0.04073494 1.0438750 -0.4092362
## age dis rad tax ptratio
## low -0.8973600 0.8610163 -0.6876449 -0.7569282 -0.4862369
## med_low -0.3597653 0.3654231 -0.5488034 -0.5160687 -0.1161077
## med_high 0.3479152 -0.3411418 -0.3860329 -0.3139980 -0.2721533
## high 0.7804655 -0.8461560 1.6382099 1.5141140 0.7808718
## black lstat medv
## low 0.38255181 -0.79336163 0.57208001
## med_low 0.32668561 -0.12304729 -0.01466654
## med_high 0.08317547 -0.03327484 0.22325857
## high -0.85961187 0.84012548 -0.65317475
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11235374 0.77807522 -0.7763944
## indus 0.07790208 -0.48663586 0.3835195
## chas -0.07622644 -0.06285565 0.1011336
## nox 0.40595925 -0.52436832 -1.5440427
## rm -0.10628779 -0.10215200 -0.1333251
## age 0.25116556 -0.19818320 -0.1517011
## dis -0.04753983 -0.23899100 0.1185567
## rad 2.83053400 1.03250053 0.1190709
## tax 0.10327671 -0.07634536 0.4582850
## ptratio 0.12033631 0.06823507 -0.3082017
## black -0.13577494 -0.01474867 0.1152250
## lstat 0.24085978 -0.29602651 0.3515430
## medv 0.22234761 -0.33848447 -0.3108624
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9437 0.0417 0.0146
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 10)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
# High values between correct and predicted values, LDA model performs well when predicting on the new data.
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 6 3 0
## med_low 7 14 9 0
## med_high 0 6 19 0
## high 0 0 0 25
# load the data
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
# k-means clustering
km <-kmeans(dist_eu, centers = 5)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
# Bonus Exercise:
# access the MASS package
library(MASS)
# load the data
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# k-means clustering
km <-kmeans(dist_eu, centers = 5)
# linear discriminant analysis
lda.fit <- lda(km$cluster ~ . , data = boston_scaled)
# print the lda.fit object
lda.fit
## Call:
## lda(km$cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3 4 5
## 0.18181818 0.04940711 0.30830040 0.23715415 0.22332016
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3951599 1.49373012 -1.0280299 0.24120575 -0.79647759 1.13584087
## 2 2.3352128 -0.48724019 1.0667704 1.30251103 1.26154533 -0.21940026
## 3 -0.4023867 -0.08196879 -0.6708441 -0.27232907 -0.71691589 0.02477035
## 4 0.6361371 -0.48724019 1.0979228 -0.14109239 1.22360302 -0.52042480
## 5 -0.3149534 -0.47775410 0.3611543 0.04124529 0.05967983 -0.35774825
## age dis rad tax ptratio black
## 1 -0.8400953 0.9065514 -0.6086194 -0.6502772 -0.9991671 0.3389339
## 2 0.9111949 -1.0255574 1.1359019 1.1548960 0.1960633 -1.4567920
## 3 -0.6577632 0.5759944 -0.5681286 -0.7193879 -0.1594854 0.3766108
## 4 0.8391434 -0.8692421 1.3447315 1.3861705 0.5748257 -0.6192283
## 5 0.4993164 -0.3832732 -0.3995070 -0.2049809 0.3798444 0.1840176
## lstat medv
## 1 -0.9505678 1.2087711
## 2 0.9594283 -0.3815121
## 3 -0.5318895 0.1487245
## 4 0.9216212 -0.7473700
## 5 0.3172279 -0.3113784
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## crim 0.19491189 -0.7127609 -1.14924280 -0.4608763
## zn 0.35812428 -1.0396097 0.41431106 0.9408250
## indus 0.37857256 0.4320719 -0.20011161 0.3773144
## chas 0.29883290 -0.4237586 -0.56277369 0.0735374
## nox 1.33917666 -0.7169740 0.23530741 -0.3177065
## rm -0.17927152 -0.4166834 0.10365939 0.2117328
## age 0.15598107 0.1800961 -0.11147413 0.7095433
## dis -0.03700046 -0.2845778 -0.08357663 -0.5600861
## rad 0.16954940 0.4993418 1.27959574 -1.1399167
## tax 1.23874792 -0.7752553 0.04457017 0.4981892
## ptratio 0.34173317 0.1200696 -0.24570231 0.4987949
## black -0.25233271 0.2567087 0.35218105 0.1031269
## lstat 0.55251648 -0.4613707 -0.02975811 0.4062116
## medv 0.29066860 -0.9089058 -0.26660104 0.2170107
##
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.7810 0.1546 0.0369 0.0275
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km$cluster)
# plot the lda results
# Variables nox, tax, zn, medv and crim are the most influencial linear separators for the clusters.
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 5)
# Super-Bonus Exercise:
# linear discriminant analysis
lda.fit <- lda(crime ~ . , data = train)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Create a 3D plots
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# Categorical plots shown: "low", "med_low", "med_high", "high"
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)
This week I have learned Dimensionality Reduction Techniques.
# load the data
# read CSV into R
human_ <- read.csv(file="http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", header=TRUE, sep=",")
# explore the human dataset
# human data: 155 obs. of 8 variables
# The data combines several indicators from most countries in the world
# "Country" = Country name
# # Health and knowledge
# "GNI" = Gross National Income per capita
# "Life.Exp" = Life expectancy at birth
# "Edu.Exp" = Expected years of schooling
# "Mat.Mor" = Maternal mortality ratio
# "Ado.Birth" = Adolescent birth rate
# # Empowerment
# "Parli.F" = Percetange of female representatives in parliament
# "Edu2.F" = Proportion of females with at least secondary education
# "Edu2.M" = Proportion of males with at least secondary education
# "Labo.F" = Proportion of females in the labour force
# "Labo.M" " Proportion of males in the labour force
# "Edu2.FM" = Edu2.F / Edu2.M
# "Labo.FM" = Labo2.F / Labo2.M
str(human_)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human_)
## [1] 155 8
summary(human_)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# Access tidyr, dplyr, GGally, corrplot
library(tidyr); library(dplyr); library(GGally); library(corrplot)
# visualize the 'human_' variables
# Edu.Exp is most normally distributed of the variables.
ggpairs(human_)
# compute the correlation matrix and visualize it with corrplot
# Strong positive correlation with Edu.Exp vs. Life.Exp variables, also strong positive correlation with Mat.Mor vs. Ado.Birth variables. Strong negative correlation with Mat.Mor vs. Life.Exp variables.
cor(human_) %>% corrplot()
# principal component analysis (PCA) on the not standardized human data
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
# create and print out a summary of pca_human
s <- summary(pca_human)
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
# principal component analysis (PCA) for standardized human data
# standardize the variables
human_std <- scale(human_)
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
# create and print out a summary of pca_human
s <- summary(pca_human)
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
# Results about the same for both standardized and not standardized human data.
# PC1 positive components are health and knowledge variables: Mat.Mor and Ado.Birth
# PC1 negative components are health and knowledge variables: Edu.Exp and Life.Exp
# PC2 positive components are empowerment variables: Labo.FM and Parli.F
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
# access the FactoMineR package
library(FactoMineR)
# load the data
data("tea")
# explore the dataset
# 300 obs. of 36 variables
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# Access ggplot2
library(ggplot2)
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) + facet_wrap("key", scales = "free")
## Warning: attributes are not identical across measure variables; they will
## be dropped
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
# draw a biplot
#biplot(mca, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = NA, ylab = NA)
# draw a biplot of the principal component representation and the original variables
#biplot(mca, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))